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Multiple Testing and Error Control in 
Gaussian Graphical Model Selection 



Mathias Drton and Michael D. Perlman 



Abstract. Graphical models provide a framework for exploration of 
multivariate dependence patterns. The connection between graph and 
statistical model is made by identifying the vertices of the graph with 
the observed variables and translating the pattern of edges in the graph 
into a pattern of conditional independences that is imposed on the 
variables' joint distribution. Focusing on Gaussian models, we review 
classical graphical models. For these models the defining conditional 
independences are equivalent to vanishing of certain (partial) correla- 
tion coefficients associated with individual edges that are absent from 
the graph. Hence, Gaussian graphical model selection can be performed 
by multiple testing of hypotheses about vanishing (partial) correlation 
coefficients. We show and exemplify how this approach allows one to 
perform model selection while controlling error rates for incorrect edge 
inclusion. 

Key words and phrases: Acyclic directed graph, Bayesian network, 
bidirected graph, chain graph, concentration graph, covariance graph, 
DAG, graphical model, multiple testing, undirected graph. 



1. INTRODUCTION 

Many models from multivariate statistics are spec- 
ified by combining hypotheses of (conditional) in- 
dependence with particular distributional assump- 
tions. In order to represent such models in a way 
that is easy to visualize and communicate, it is natu- 
ral to draw a graph with one vertex for each variable 
and an edge between any two variables that exhibit 
a desired type of dependence. In graphical mod- 
eling (Cox and Wermuth (1996), Edwards (2000), 
Lauritzen (1996), Whittaker (1990), Studeny (2005)), 
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a rigorous version of this idea is used to associate a 
statistical model with a graph. Via so-called Markov 
properties, the pattern of edges in the graph is trans- 
lated into conditional independence statements, 
which are then imposed on the joint distribution 
of the variables that are identified with the graph's 
vertices. In this process, different graphs with dif- 
ferent types of edges have been equipped with dif- 
ferent Markov properties. For example, graphs with 
undirected edges have been given an interpretation 
that requires two variables that are not joined by an 
edge to be conditionally independent given all other 
variables. Markov chains, Markov random fields and 
certain types of hierarchical log-linear models are 
examples of models that can be represented in this 
way. Graphs with directed edges have been used to 
encode dependence structures that arise from cause- 
effect relationships among variables (Lauritzen (2001), 
Pearl (2000), Spirtes, Glymour and Scheines (2000)), 
and the associated directed graphical models are 
also known as Bayesian networks. Other types of 
graphs, sometimes featuring different types of edges 
simultaneously, have been used to represent other 
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dependence structures. We note that in graphs fea- 
turing directed edges, directed cycles, which at an 
intuitive level correspond to feedback loops, are typ- 
ically forbidden. 

Much of the success of graphical models in ap- 
plications, such as the classic application in proba- 
bilistic expert systems (Cowell et al. (1999)), is due 
to favorable computational properties. The indepen- 
dences imposed on the distributions in a graphical 
model typically induce factorizations of joint densi- 
ties into smaller, more tractable pieces. The graph- 
ical representation of the model then helps to orga- 
nize computations with these pieces in order to solve 
statistical inference problems efficiently 
(Jordan (2004)). In fact, the models are sometimes 
defined as families of distributions with densities fac- 
toring according to a given graph (see, e.g., Jensen 
(2001)). Conditional independences are then viewed 
as consequences of such density factorization. 

Recently, graphical models have been applied fre- 
quently to the analysis of biological data (see, e.g., 
Beerenwinkel and Drton (2007), Jojic et al. (2004), 
Lauritzen and Sheehan (2003), McAuliffe, Pachter 
and Jordan (2004)). In particular, the abundance of 
gene expression data from microarray experiments 
has stimulated work on exploratory data analysis 
focusing on model selection. In the graphical con- 
text, this amounts to selection of the underlying 
graph which may reveal aspects of the network reg- 
ulating the expression of the genes under study; see 
Butte et al. (2000), Castelo and Roverato (2006), 
Dobra et al. (2004), Magwene and Kim (2004), 
Li and Gui (2006), de la Fuente et al. (2004), 
Matsuno et al. (2006), Wille et al. (2004), 
Schafer and Strimmer (2005) and the review by 
Friedman (2004). 

Three approaches to graphical model selection are 
commonly taken. The constraint-based approach, 
which has a long history, is the simplest and employs 
statistical tests of the model-defining conditional in- 
dependence hypotheses (Wermuth (1976), Badsberg 
(1992); Edwards and Havranek (1985), 1987; Kreiner 
(1987), Smith (1992), Spirtes, Glymour and Schemes 
(2000), Drton and Perlman (2004)). A second 
method is a score-based search in which models are 
selected by searching through the space of underly- 
ing graphs and maximizing a goodness-of-fit score 
such as the Bayesian Information Criterion (BIC) 
(Schwarz (1978)). The search is often done greed- 
ily by defining a neighborhood structure for graphs 



and terminating with a graph for which no neigh- 
boring graph achieves a higher score. While moving 
in the space of undirected graphs is straightforward 
by single-edge additions and deletions, this is less 
simple for graphs with directed edges due to the 
acyclicity conditions that are usually imposed (see, 
e.g., Chickering (2002)). Finally, the methodologi- 
cally most demanding approach to model selection 
is the Bayesian approach. It requires specification of 
appropriate prior distributions (Dawid and Lauritzen 
(1993), Roverato (2002), Roverato and Consonni 
(2004), Atay-Kayis and Massam (2005)) and com- 
putation of/sampling from the resulting posterior 
distribution on the space of models. Bayesian model 
determination has been studied for undirected and 
directed graphs (Cooper and Herskovits (1992), 
Madigan and Raftery (1994), Heckerman, Geiger 
and Chickering (1995),Giudici and Green (1999), 
Consonni and Leucari (2001), Dellaportas, Giudici 
and Roberts (2003)). 

In this paper we consider Gaussian graphical mod- 
els, which are used in particular for analysis of the 
continuous gene expression measurements. In Sec- 
tion 2 we review Gaussian graphical models based on 
undirected, bidirected and acyclic directed graphs. 
The latter graphs are also known as acyclic digraphs, 
directed acyclic graphs or, in short, DAGs. We also 
comment on other graphs that have been used to 
represent statistical models. Many of these induce 
Gaussian models that are fully specified by a pair- 
wise Markov property that associates one conditional 
independence statement with each pair of vertices 
that are nonadjacent, that is, not joined by an edge. 
For some graphs, such as undirected and bidirected 
ones, the conditioning set in such a pairwise condi- 
tional independence statement does not depend on 
the structure of the graph, which is important in our 
subsequent approach to the model selection problem 
in which the graph is unknown. For other types of 
graphs, the same may hold only if a priori informa- 
tion is available that allows one to restrict attention 
to a restricted subset of graphs. For example, for 
DAGs, this a priori information may take the form 
of a total order among the variables, which deter- 
mines the orientation of any directed edge that is 
deemed to be present in the graph. 

When the absence of edges corresponds to pair- 
wise conditional independence statements, model se- 
lection can be performed by testing each conditional 
independence statement individually. By translating 
the pattern of rejected hypotheses into a graph, one 
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obtains a constraint-based model selection method 
in which error rates for incorrect edge inclusion can 
be controlled when the multiple testing problem is 
addressed appropriately. Traditionally, such multi- 
ple testing approaches are deemed to be not very 
powerful (Smith (1992)), and instead, classical 
constraint-based methods use sequential tests in 
schemes such as, for example, stepwise forward/ 
backward selection. Such sequential procedures may 
possess greater power for determining the true graph, 
but the link between the significance level of indi- 
vidual hypothesis tests and overall error properties 
of the resulting model selection procedure is gener- 
ally not clear. However, in light of recent advances in 
multiple testing, it seems worthwhile to revisit the 
multiple testing approach to graphical model selec- 
tion. The recent progress not only provides more 
powerful multiple testing procedures but also allows 
one to control different types of error rates such as 
(generalized) family-wise error rate, tail probability 
of the proportion of false positives and false discov- 
ery rate (Sections 3.3 and 3.4). We illustrate the 
methodology in examples of exploratory data anal- 
ysis (Section 4), in which the multiple testing ap- 
proach allows us to identify the most important fea- 
tures of the observed correlation structure. Before 
concluding in Section 6, we show how prior knowl- 
edge about the absence or presence of certain edges 
can be exploited in order to test fewer and possibly 
simpler hypotheses in the model selection procedure 
(Section 5). 

2. GAUSSIAN GRAPHICAL MODELS 



tributed according to the multivariate normal distri- 
bution M p (fi, £). It is assumed throughout that the 
covariance matrix £ is nonsingular. Let G = (V, E) 
be a graph with vertex set V = {1, . . . ,p} and edge 
set E. The connection between graph and statis- 
tical model is made by identifying the vertices V 
of the graph G with the variables Y\, . . . , Y p . Then 



the edge set E induces conditional independences 
via so-called Markov properties. In order to be able 
to represent different types of dependence patterns, 
different types of graphs have been equipped with 
different Markov properties. 

2.1 Undirected Graphical Models 

Let G = (V, E) be an undirected graph, that is, all 
edges in the graph are undirected edges i — j. The 
pairwise undirected Markov property of G associates 
the conditional independence 



(2.1] 



Yi il Yj 



Y 



V\{i,j} 



with all pairs 1 < i i < j < p, for which the 

edge i — j is absent from G. For example, the pair- 
wise Markov property for the undirected graph in 
Figure 1(a) specifies that Yi _LL Y 3 | (Y 2 , Y 4 ), Y 1 _L 
J_ Y 4 | (Y 2 ,Y 3 ) and Y 2 Jl Y 3 | (Yi,Y 4 ). The Gaus- 
sian graphical model N(G) associated with the undi- 
rected graph G is defined as the family of all p- 
variate normal distributions Ap(/i, S) that obey the 
conditional independence restrictions (2.1) obtained 
from the pairwise Markov property. Since the ran- 
dom vector Y is distributed according to the mul- 
tivariate normal distribution N p (n, £), we have the 
equivalence 

(2.2) Yi Jl Yj | Y n{iJ} <=> Pij.v\{ij}=0, 

where Pij-v\{i,j} denotes the ijth partial correlation, 
that is, the correlation between Yi and Yj in their 
conditional distribution given YvAfij}- This partial 
correlation can be expressed in terms of the elements 
of the concentration = precision matrix X -1 = {c 1 - 7 }, 



Let Y = (Yi , . . . , Y p ) 1 e W be a random vector dis- (2.3) 



Pij-v\{i,j} 



-a- 



compare Lauritzen (1996, page 130). 

The model N(G) has also been called a covariance 
selection model (Dempster (1972)) and a concentra- 
tion graph model (Cox and Wermuth (1996)). The 
latter name reflects the fact that N(G) can easily be 
parametrized using the concentration matrix S^ 1 . 
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Fig. 1. (a) An undirected graph, (b) a bidirected graph and (c) an acyclic directed graph (DAG). 
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Gaussian undirected graphical models are regular 
exponential families with well-developed statistical 
methodology (Edwards (2000), Lauritzen (1996), 
Whittaker (1990)). 

Let Y be a random vector whose joint distribu- 
tion Ap(/i, E) is in the model N(G). By definition 
the pairwise conditional independences (2.1) hold 
among the components of Y. However, these gen- 
erally imply other independence relations. For ex- 
ample, if G is the graph from Figure 1(a), then the 
implied independence relations include 

(2.4) Y 1 1LY Z \Y A 
and 

(2.5) Yiil(y 3 ,*4)|Y 2 . 

One of the benefits of representing the statistical 
model graphically is that all such independence con- 
sequences can be read off the graph using a criterion 
known as the global Markov property. This criterion 
is based on paths in the graph, where a path is de- 
fined as a sequence of distinct vertices such that any 
two consecutive vertices in that sequence are joined 
by an edge. For disjoint subsets A, B and C of the 
vertex set V, the global undirected Markov property 
of G states that 

(2.6) Y A ALY B \Y 

if there does not exist a path in the graph that leads 
from a vertex in A to a vertex in B and has no 
nonendpoint vertex in C . In other words, the set of 
vertices C separates the vertices in A from those 
in B. Note that C may be the empty set, in which 
case conditional independence given Y is under- 
stood to be marginal independence of Ya and Yb- 
In the graph from Figure 1(a), there is no path from 
vertex 1 to vertex 3 (or 4) that does not go through 
vertex 2, which yields (2.5); (2.4) is obtained simi- 
larly. 

The conditional independence statements (2.1) are 
saturated in the sense that they involve all the vari- 
ables at hand. At the other end of the spectrum of 
pairwise independence statements is marginal inde- 
pendence, the graphical representation of which we 
discuss next. 

2.2 Bidirected Graphical Models 

Let G = (V,E) be a bidirected graph with edges 
drawn as i < — > j. The pairwise bidirected Markov 
property of G associates the marginal independence 

(2.7) Yi il Yj 



with all pairs (i, j), 1 < i < j < p, for which the edge 
i < — ► j is absent from G. The graph in Figure 1(b), 
for example, leads to Y\ J_L Y3, Y\ _LL Y4 and Y2 -LL 
Y3. The Gaussian graphical model N(G) associated 
with the bidirected graph G is defined as the family 
of all p-variate normal distributions JV^(//,£) that 
satisfy the marginal independence restrictions (2.7). 
Obviously, under multivariate normality, 

(2.8) YilLYj <=^ Pij =0, 

where 



denotes the ijth correlation, that is, the correlation 
between Y{ and Yj. 

The model N(G) has also been called a covari- 
ance graph model (Cox and Wermuth (1996)). We 
note that Cox and Wermuth (1993, 1996) and some 
other authors have used dashed instead of bidirected 
edges. Gaussian bidirected graphical models are 
curved exponential families, and the development of 
theory and methodology for these models is still in 
progress (Chaudhuri, Drton and Richardson (2007), 
Kauermann (1996), Wermuth, Cox and Marchetti 
(2006)). 

The duality between saturated pairwise conditional 
independence and marginal independence leads to 
a nice duality between the global Markov proper- 
ties for undirected and bidirected Gaussian graphi- 
cal models. Let A, B and C be disjoint subsets of the 
vertex set V . The global bidirected Markov property 
for the graph G states that 

(2.10) Y A 1LY B \Y C 

if there does not exist a path in the graph that leads 
from a vertex in A to a vertex in B and has ev- 
ery nonendpoint vertex on the path in C. In the 
graph from Figure 1 (b) , it holds that Y\ _I_L I3 | Y2 
because the unique path from vertex 1 to vertex 3 
contains the vertex 4 that is not in the conditioning 
set {2}. For background regarding the Markov prop- 
erties of bidirected graphs see Pearl and Wermuth 
(1994), Kauermann (1996), Banerjee and Richardson 
(2003) and Richardson (2003). 

Both undirected and bidirected graphs have edges 
without directionality, and any two vertices are ei- 
ther joined by an edge or not. Consequently, in each 
case there exists a unique complete graph, that is, 
a graph in which all vertices are joined by an edge. 
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Moreover, the conditioning sets in the pairwise con- 
ditional independences (2.2) and (2.8) do not de- 
pend on the structure of the graph. For the acyclic 
directed graphs introduced next this is no longer 
true. Since a directed edge between two vertices i 
and j may be either i — ► j or i < — j, there no 
longer exists a unique complete graph, and the con- 
ditioning set in pairwise independence statements 
associated with missing edges will depend on cer- 
tain higher-order aspects of the graph. 

2.3 Directed Graphical Models 

A graph with directed edges i — ► j is called acyclic 
if it contains no directed cycles. A directed cycle is 
a path of the form % — ► • • • — ► i. In the literature, 
the term directed acyclic graph is often used to refer 
to these graphs. While this is somewhat imprecise 
terminology, it yields the popular acronym "DAG," 
which we also use here. 

Let G = (V, E) be an acyclic directed graph/DAG. 
The directed edges in E define a partial ordering =<; 
of the vertices V = { 1, . . . , p} in which i =<! j if i = j 
or there is a directed path i — ► • • • — ► j from i to 
j in G. Not all pairs of vertices must be comparable 
with respect to this partial ordering. For example, 
in the graph in Figure 1(c), vertices 2 and 3 are in- 
comparable. However, the partial order can always 
be extended (possibly nonuniquely) to a total or- 
der =<! under which i<j whenever i =4 j. (This may 
require renumbering the vertices.) Such a number- 
ing is called a well-numbering or topological ordering 
of the vertex set V. In the example, the vertex set 
is well-numbered, but exchanging vertex numbers 2 
and 3 also yields a well-numbering. In the sequel, we 
assume that the vertex set V is well-numbered. 

The well-numbered pairwise directed Markov prop- 
erty of G associates the conditional independence 

(2.11) Y t lLYj\Y {1 _ mhj} 

with all pairs 1 < i < j < p, for which the edge 

i — > j is absent from G; compare, for example, (7.2) 
in Edwards (2000) and (2.5) in Drton and Perlman 
(2007). Note that the well- numbering and the as- 
sumed order i < j preclude the existence of the edge 
i < — j. In the example of Figure 1(c), (2.11) spec- 
ifies that Yi 11 y 3 | y 2 , Y 2 11 Y-i | Yy and Y x _LL Y A \ 
(Y^j^j)- The Gaussian graphical model N(G) as- 
sociated with the DAG G is defined as the family 
of all p-variate normal distributions Ap(//, S) that 
obey the restrictions (2.11). 



The conditional independences (2.11) defining the 
model N(G) associated with a DAG G may at first 
sight appear to have a less clear interpretation than 
the ones associated with undirected or bidirected 
graphs. However, perhaps even the contrary is true, 
as the model N(G) based on the DAG G exhibits 
a dependence structure that can be expected if the 
(directed) edges in G represent cause-effect relation- 
ships. Thinking in this causal fashion, if there is no 
edge between vertices i and j, then Yi is not an im- 
mediate cause of Yj. Hence, if we condition on the 
variables Kri i ...j}\{jj}, which include all immediate 
(or direct) causes of Yj, then Yi should have no effect 
on Yj as stated in (2.11). 

Since we assume that Y follows a multivariate nor- 
mal distribution J\f p (fJ,, £), it holds that 



Pij-{i,...,j}\{i,j} - u, 

where for C C V \ {i,j} we define Pij.c to be the 
partial correlation of Yi and Yj given Yq. Clearly 
Pij.c is a function of the (C U {i,j}) x (C U 
submatrix of E; compare (2.3). 

The model N(G) can be shown to correspond to 
a system of linear regressions. For each variable Yi 
there is one linear regression in which Yi is the re- 
sponse variable and variables Yj with j — > i in G 
are the covariates ( Wermuth (1980), Andersson and 
Perlman (1998)). It can be shown that the set of 
covariance matrices giving rise to distributions in 
N(G) can be parametrized in terms of regression 
coefficients and residual variances, which constitute 
the factors of a Choleski decomposition of 

The definition of N(G) does not depend on the 
choice of the underlying well-numbering, which is 
not unique in general. This follows because a multi- 
variate normal distribution exhibits the conditional 
independences (2.11) stated by the well-numbered 
pairwise directed Markov property if and only if it 
obeys the conditional independences stated by the 
more exhaustive global directed Markov property, 
which does not depend on the choice of the well- 
numbering; see, for example, Cowell et al. (1999, The- 
orem 5.14), and Drton and Perlman (2007, Appendix 
A, Theorem 3). The global directed Markov prop- 
erty for a DAG can be formulated in two equivalent 
ways (Lauritzen (1996)). One way uses a connection 
to undirected graphs and their Markov properties 
via what is known as moralization, which involves 
path separation in certain augmented subgraphs of 
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the DAG. The other, which we detail next, is based 
on the so-called d-separation criterion which can be 
applied in the original DAG, but involves an ex- 
tended definition of separation in the DAG. 

Consider a path consisting of the vertices io,h, ■ ■ ■ , 
i m . A nonendpoint vertex i^, 1 < k < m — 1, on a 
path is said to be a collider if the preceding and suc- 
ceeding edges both have an arrowhead at i^, that is, 
the configuration ik—i — ► ik i — ^fc+i occurs in the 
path. A nonendpoint vertex which is not a collider 
is said to be a noncollider. Two vertices i and j are 
d-separated given a (possibly empty) set C ^ C) 
if every path between % and j is blocked relative to C, 
that is, there is either (i) a noncollider in C, or (ii) 
a collider c ^ C such that there is no vertex c E C 
with c — ► • • • — ► c in G. If A, B and C are disjoint 
subsets of the vertex set V, then A and B are said 
to be d-separated given C if every pair of vertices 
(i,j) with i E A and j E B is d-separated given C. 
The global directed Markov property of the DAG G 
states that 

(2.13) Y A ALY B \Y 

whenever A and B are d-separated given C. 

In the graph in Figure 1(c), the global Markov 
property implies that (Yi,!^) _LL Y% because the 
unique path from vertex 2 to vertex 3 contains the 
vertex 4, which is a collider that trivially satisfies 
property (ii) above because the conditioning set C 
is empty. In this example, it also holds that Y\ _LL 
(13,14) j I2 because vertex 2 is a noncollider on the 
unique path from vertex 1 to vertex 3 (and 4, resp.), 
and the conditioning set contains this noncollider. 

A difficulty in working with DAGs is the fact that 
two different DAGs can induce the same statistical 
model, in which case the two graphs 
are called Markov equivalent. A characterization of 
this equivalence can be found in 
Andersson, Madigan and Perlman (1997). However, 
if two different DAGs share a well-numbering, then 
they must induce different statistical models, which 
can be derived, for example, from Lemma 3.2 in 
Andersson, Madigan and Perlman (1997). 

2.4 Related Models With Graphical 
Representation 

Many other models considered in the literature 
benefit from a graphical representation. Most closely 
connected to the models discussed above are chain 
graph models. A chain graph is a hybrid graph fea- 
turing both directed and undirected edges. The name 



"chain graph" reflects the fact that the vertex set of 
these graphs can be partitioned into ordered blocks 
such that edges between vertices in the same block 
are undirected, and edges between vertices in dif- 
ferent blocks are directed, pointing from the lower- 
ordered block to the higher-ordered block. 

Two alternative Markov properties for chain graphs 
have been thoroughly studied in the literature: the 
LWF Markov property of Lauritzen and Wermuth 
(1989) and Frydenberg (1990), and the 
more recent AMP Markov property of 
Andersson, Madigan and Perlman (2001). Results 
on Markov equivalence and generalizations of the 
d-separation criterion can be found in 
Studeny and Bouckaert (1998), Studeny and Roverato 
(2006), Andersson and Perlman (2006) and 
Levitz, Perlman and Madigan (2001). Statistical in- 
ference for Gaussian chain graph models is discussed, 
for example, in Lauritzen (1996) and Drton and Eichler 
(2006). We note that chain graphs with bidirected 
instead of undirected edges can also be considered; 
see, for example, Wermuth and Cox (2004) where 
dashed edges are used in place of bidirected ones. 

Another class of graphical models are based on the 
ancestral graphs of Richardson and Spirtes (2002). 
These graphs may include undirected, directed and 
bidirected edges (although the undirected edges do 
not occur in an essential way) and permit one to 
represent all independence structures that may arise 
from a DAG model under conditioning and marginal- 
ization. Ancestral graphs are also related to path di- 
agram/structural equation models that are popular 
in econometrics and the social sciences; see, for ex- 
ample, Koster (1999). Finally, graphs have also been 
used to represent time series and stochastic process 
models (Dahlhaus (2000), Dahlhaus and Eichler (2003), 
Eichler (2007), Fried and Didelez (2003), 
Didelez (2007)). 

3. MODEL SELECTION BY MULTIPLE 
TESTING 

Let Y^\ . . . , y( n ) be a sample from a multivariate 
normal distribution J\f p (fJ,, S) in a Gaussian graph- 
ical model N(G), where G = (V,E) is an unknown 
undirected, bidirected or acyclic directed graph. The 
sample information can be summarized by the suf- 
ficient statistics, which are the sample mean vector 

(3.1) Y = -J2 Y(m) £ RV 

ft -. 

rn=l 
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and the sample covariance matrix 
1 n 

s = (y( m ) — Y)(Y( m ) — Y) 1 

nrt — 1 ^ — ' 



The problem we consider here is the recovery of the 
unknown graph underlying the assumed Gaussian 
graphical model. This is a problem of model selec- 
tion. We note that in this paper we consider the case 
where the sample size is moderate to large com- 
pared to the number of variables. More precisely, 
we assume that n > p + 1 in order to guarantee (al- 
most sure) positive definiteness of the sample co- 
variance matrix S. For work on problems in which 
the sample size is small compared to the number 
of variables, see, for example, Jones et al. (2005) or 
Meinshausen and Biihlmann (2006), where sparsity 
restrictions are imposed on the unknown graph. 

3.1 Model Selection and Hypotheses of 
Vanishing Partial Correlations 

As presented in Section 2, Gaussian graphical mod- 
els can be defined by pairwise conditional indepen- 
dence hypotheses or equivalently by vanishing of 
partial correlations. This suggests that we can per- 
form model selection, that is, recover the graph G, 
by considering the p(p — l)/2 testing problems 

H i:j : Pij. C (i,j) = vs. Kij : Pij. C (i,j) + 

(3.3) 

(l<i<j<p). 

For undirected graphs, (2.2) dictates choosing C(i, 
j) = V \ {i,j}, and for bidirected graphs we choose 
C(i,j) = in accordance with (2.8). For DAGs, 
(2.12) leads to the choice C(i,j) = {1, . . . , j} \ {i,j}, 
which, however, is valid only under the assumption 
that the vertex set V = {1, . . . ,p} is well- numbered 
for the unknown true DAG. 

Thus in order to be able to select a DAG via the 
testing problems (3.3), we must restrict attention 
to the situation where we have a priori information 
about a well-numbering of the vertex set of the un- 
known DAG. We then select a graph from the set of 
DAGs for which the specified numbering of the vari- 
ables is a well-numbering. In an application, a priori 
information about temporal or causal orderings of 
the variables can yield a known well-numbering. For 
example, Spirtes, Glymour and Schemes (2000, Ex- 
ample 5.8.1) analyze data on publishing productiv- 
ity among academics, which involve seven variables 



that obey a clear temporal order. In many other 
applications such a total order among the variables 
may not be available. However, as we mention in 
Section 6, a partial order is sufficient for selection of 
a chain graph (recall Section 2.4). 

If in the true graph G there is an edge between 
vertices i and j, then hypothesis Hij is false and 
the alternative Kij is true. Consequently, if we have 
performed the p{p — l)/2 tests of the hypotheses in 
(3.3), then we can select a graph by drawing an edge 
between i and j if and only if the hypothesis 
is rejected. Let a € (0, 1) be the significance level 
employed, and let tt^ be the p-value of the test of 
hypothesis Hy in (3.3). Then the graph G(a) that is 
selected at level a has the adjacency matrix A(a) = 
(aij(a)) £ W xp with entries 

^ %•(«)= ( 0) if 

In the sequel we will focus on addressing the issue 
of multiple testing in this approach, which leads to 
model selection procedures in which overall error 
rates (with respect to false inclusion of edges) can 
be controlled. 

We remark that testing the hypotheses in (3.3) is 
also the first step in stepwise model selection pro- 
cedures (Edwards (2000), Section 6.1; also see Sec- 
tion 3.1). In backward stepwise selection, for exam- 
ple, each hypothesis in (3.3) is tested individually 
at a fixed significance level a. The largest of the p- 
values for the hypotheses that are not rejected is de- 
termined and the associated edge is removed from 
the graph. In the next step the remaining edges/ 
hypotheses are tested again in the reduced graph, 
also at level a. The procedure stops if all remaining 
hypotheses are rejected at level a. While retesting 
in a reduced graph allows one to take advantage of 
sparsity of the graph, which induces independence 
features and may allow for more efficient parameter 
estimation and testing, the "overall error proper- 
ties [of such stepwise selection procedures] are not 
related in any clear way to the error levels of the 
individual tests" (Edwards (2000), page 158). 

3.2 Sample Partial Correlations 

A natural test statistic for testing hypothesis 
: Pij.cuj) = is the sample partial correlation 
r ij-C(i,j)i is, the partial correlation computed 
from the sample covariance matrix S; recall (2.3). 
The marginal distribution of Tij.c(i,j) nas t ne same 



8 



M. DRTON AND M. D. PERLMAN 



form as the distribution of the ordinary sample cor- 
relation rjj, but with the parameter pij replaced by 
Pij.CUj) and the sample size n reduced to nc(i t j) = 
n- \C(i,j)\ (Anderson (2003), Theorem 4.3.5). The 
marginal distribution of the sample correlation re- 
takes on a simple form if the ith and jth compo- 
nents of the normal random vector from which it is 
derived are independent. 

Proposition 1. // the true correlation p^ is 
zero, then y/n — 2 ■ rij/^Jl — rfj has a t- distribution 
with n — 2 degrees of freedom. 

In the noncentral case, pij ^ 0, the exact distri- 
bution of Tij can be described using hypergeometric 
functions, but it is simpler to work with Fisher's 
variance-stabilizing z-transform. 

Proposition 2. Let 

1 /l + r\ 
s:(-l,l)->R, r ^_i n ^__j 

be the z-transform. An accurate normal approxima- 
tion to the distribution of Zij = z(rij) can be obtained 
from the fact that 

\Jn — 3 (z^ — Qj) M(0, 1) as n — > oo, 

where = z(pij). Note that Qj = if and only if 
Pij = o. 

Concerning finite-sample properties, little is lost 
by working with the normal approximation to Fisher's 
z, as this approximation is accurate even for mod- 
erate sample size (Anderson (2003), Section 4.2.3). 
At the same time much convenience is gained be- 
cause of the variance-stabilizing property of the z- 
transform and the fact that the joint distribution of 
z-transformed sample correlations is easily deduced 
from that of the untransformed sample correlations; 
compare Proposition 5 below. 

A sample partial correlation rij.cn j) is a smooth 
function of the sample covariance matrix S. The ran- 
dom matrix (n— 1)5 has a Wishart distribution with 
n — 1 degrees of freedom and scale parameter ma- 
trix E. The asymptotic normal distribution of both 
S and S 1-1 can be described using Isserlis matrices 
(Olkin and Siotani (1976), Roverato and Whittaker 
(1998)). 

Proposition 3. Let Iss(S) be the Isserlis ma- 
trix ofT,, that is, the p{p + l)/2 x p(p-\- 1) /2-matrix 
with entries 

Iss(X])2j ^uv — 0~iuO~jv ~t~ 0~ivO~jui 

l<z<j"<p, l<ti<t;<p. 



Then 

y/n(S - S) A/" p(p+1)/2 (0, Iss(S)), 

and 

V^iS' 1 - E" 1 ) ^AA p(p+1)/2 (0,Iss(S- 1 )). 

Using the delta method (van der Vaart (1998)), 
the joint asymptotic normal distribution of the vec- 
tor of sample partial correlations r = (rij.cn j) | 1 < 
i < j <p) can be derived. For ordinary correlations, 
for which C(i,j) = for all 1 < i < j < p, and satu- 
rated partial correlations with C(i,j) = V\{i,j} for 
all 1 < i < j < p, the following result is quickly ob- 
tained using software for symbolic computation. The 
statement about ordinary correlations goes back to 
Aitkin (1969, 1971) and Olkin and Siotani (1976). 

Proposition 4. The vector of ordinary correla- 
tions is asymptotically normal, 

Vn(r- p) ™JVp( p _i)/2(0,O), 

with p = (p^ | 1 < i < j < p) and the asymptotic co- 
variance matrix O = (oJijki) given by 

= [1 — (Pij) 2 ] 2 i 
Uij,M = -\piiPnW ~ (Pij) 2 ~ (pu) 2 ~ (Pje) 2 ] 

+ Pji[ 1 ~ (Pij) 2 - (Pie) 2 }, 

^ij,k£ = \pijPki[(Pik) 2 + (Pu) 2 + (Pjk) 2 + (Pji) 2 )] 
+ PikPjl + PiiPjk — PikPjkPki 
— PijPikPil — PijPjkPjl — PilPjiPkl- 

The same result holds for the vector of saturated par- 
tial correlations Tij.y\Uj\ if we replace all p^j by 
Pij-V\{i,j} i n the above formulas, where, for more 
accurate normal approximation, the sample size n 
should also be replaced by n v\{i,j} = n — p — 2. 

For small number of variables p, asymptotic co- 
variance matrices for vectors of partial correlations 
r ij-{i,...,j}\{i,j}> as required for DAG selection, can be 
computed using software for symbolic computation, 
but we are not aware of any general formulas in the 
literature. 

By again applying the delta method, the following 
result is obtained: 

Proposition 5. The asymptotic covariance ma- 
trix of the vector of z -transformed partial correla- 
tions zij.c^i.j) = z(rij.c(ij)) is the correlation matrix 
of the asymptotic covariance matrix f2 of the vector 
of untransformed partial correlations rij.cn j\. 
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With these preliminaries we can now turn to the 
problem of error control in the multiple testing prob- 
lem (3.3), which we formulate in terms of p- values. 

3.3 Controlling Family-Wise Error Rate 

A multiple testing procedure for problem (3.3) is 
said to control the family-wise error rate (FWER) 
at level aE (0,1) if for any underlying multivariate 
normal distribution A/^,(/U,S), the probability of re- 
jecting one or more null hypotheses Hij incorrectly 
is smaller than or equal to a. If a multiple testing 
procedure controls the FWER at level a, then its (si- 
multaneous) p- values {iTij | 1 < i < j < p} have the 
property 

Probjy^s) (% : 7r^ < a but Hy is true) 
= Prob A/ ^ (Al S )(3i i :edge i — j included 

when actually absent) < a. 

Since control is achieved at all multivariate normal 
distributions, that is, at all patterns of true null hy- 
potheses, this is sometimes referred to as "strong 
control" (Dudoit, Shaffer and Boldrick (2003)). For 
the graph selected according to (3.4), this means 
that with probability at most a the selected graph 
G(a) is not a subgraph of the true graph G, 



(3.5) 



Prob G (G(a) £G)< 



a. 



(By definition, a subgraph of G contains no edges 
that are absent in G.) The notation Prob^ in (3.5) 
denotes any probability calculation under a distri- 
bution W p (/i, £) £ N(G). 

In (3.5), the error with respect to incorrect edge 
inclusion is controlled in finite samples. Sometimes it 
may, however, only be feasible to achieve asymptotic 
control of the form 

lim sup Probjy- ( At ,s)(3ij : T^ij < a but is true) < a, 
and hence, 



(3.6) 



lim sup Prob G (G (a) <£G)< 



ex. 



In particular, when using z-transformed correlations 
and normal approximations to their distributions, 
one may only hope to achieve asymptotic control 
given by (3.6). But the concept of asymptotic con- 
trol is also central to recently introduced multiple 
testing procedures that employ consistent estimates 
of the joint distribution of the test statistics (see 
Section 3.3.2). 



Any model selection procedure that asymptoti- 
cally controls FWER at fixed significance level a, 
that is, satisfies (3.6), is (1 — a)-consistent in the 
following sense. Let Gf aith fui = Gfaithfui(S) be the 
graph that has the fewest edges among all graphs G' 
for which the data-generating distribution M p (p, X) 
is in N(G'). The definition of 67f ait hf u i is straight- 
forward for undirected and bidirected graphs. For 
DAGs, recall that we assume to know a well-number- 
ing of the variables a priori, in which case Gf a ithfui is 
again well defined. The distribution M p (fi, X) is pair- 
wise faithful to G7f a ithfui in the sense that Pij.c(i,j) = 
Pij-C(i,j){^') = if and only if the edge between ver- 
tices i and j is absent from Gf a ;thfui- Then it can be 
shown that 



(3.7) liminf Probes) (G(a) = G 



faithful. 



> 1 



a: 



compare Drton and Perlman [2004, (2.18)]. This 
means that the selection procedure identifies Gf a i t hf u i 
with asymptotic probability at least (1 — a). More- 
over, if the sample size n can be chosen large enough, 
then the asymptotic probability Prob^^^) (G(a) / 
Gfaithfui) can be made arbitrarily small by choos- 
ing a arbitrarily small [Drton and Perlman (2004), 
(2.19)-(2.21)]. In this sense the procedure is fully 
consistent. However, the choice of n depends on a 
as well as on S. 

The above notion of faithfulness is defined with re- 
spect to the pairwise Markov property. In other con- 
texts (e.g., Spirtes, Glymour and Scheines (2000), 
Wille and Biihlmann (2006), Becker, Geiger and Meek 
(2000)), the stronger condition of faithfulness with 
respect to the global Markov property is considered. 
A distribution is globally faithful to a graph G if it 
exhibits a conditional independence Ya -U- Yb I 
if and only if the sets A, B and C fulfill the graphi- 
cal separation property used in the definition of the 
global Markov property. However, while every mul- 
tivariate normal distribution is pairwise faithful to 
some graph in the considered class of graphs, it is 
easy to see that there exist normal distributions that 
are not globally faithful to any graph in the class. 
For example, consider the centered trivariate normal 
distribution with covariance matrix 




1 
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Choosing the context of undirected graphs, we see 
that this distribution is pairwise but not globally 
faithful to the complete graph, since CT13 = implies 
YlALY 3 . 

3.3.1 Multiple testing procedures using the marginal 
distributions of the sample correlations. Classical 
generally applicable multiple testing procedures are 
based on the marginal distributions of the test statis- 
tics alone. When testing Hy : pij.c(i,j) = via the 
normal approximation to the z-transformed sample 
partial correlation 2y.c(ij) given in Proposition 2, 
we obtain the unadjusted p-value 



where again 7Ty is given by (3.8). The p- values 7r,^ ldak , 
which appear in Drton and Perlman [2004, (2.9)], 
can in turn be improved in a step-down approach to 



(3.8) Try = 2[1 - <&( Jnciij) 



3 • \zij. C (i,j)\ 



where <& is the cumulative distribution function of 
the standard normal distribution A/"(0, 1). These p- 
values can now be adjusted to achieve FWER con- 
trol (3.6) in the selection of the graph G(a) defined 
in (3.4). 
The Bonferroni p- values 



(3.9) 



TT 



Bonf 



mm 



1 < i < j < P, 



are the simplest such adjusted p-values. An easy, 
more powerful adjustment is obtained by the step- 
down method of Holm (1979). In this method the 
p-values TTij in (3.8) are ordered as ir^ < 7r 2 | < • ■ • 
and the ordered adjusted p- values are obtained as 



TT 



Bonf , Step 



(3.10) 



max 
6=l,...,o 



mm 



6 +1^,1 J 



l<a<© 



The adjusted p-value Tj-Bo^step j s then associated 
with the hypothesis Hij that gave rise to the ath 
smallest unadjusted p- value. Note that if the original 
p-values in (3.8) were computed using the 
i-transform in Proposition 1 instead of Fisher's 
z-transform, then both these p-value adjustments 
would provably achieve finite-sample error control 
(3.5). 

A more powerful procedure than Bonferroni is ob- 
tained by applying Sidak's inequality (Sidak (1967)) 
to the joint asymptotic normal distribution of the 
vector of transformed sample partial correlation co- 
efficients Zij.cuj) - The inequality yields the p- values 



(3.11) 



Sidak 
ij 



(l-7Ty)(a), l<i<j<p, 



TT 



Sidak, Step 
aT 



(3.12) 



max 
b=l,..., 



[1 



At 



l<a<© 



As in the case of the Bonferroni- adjusted p- values, 
the index a\ refers to the ath smallest p-value and 
the ordered adjusted p-value ?r Jj. d ' tep j s to be as- 
sociated with the hypothesis that gave rise to 
the ath smallest unadjusted p-value. Both sets of p- 
values, 7r^ ldak and 7Ty ld ' tep , define a graph G(a) 
that satisfies (3.6). 



3.3.2 Multiple testing procedures using the joint 
distribution of the sample correlations. Westfall and 
Young (1993) describe multiple testing methods that 
can improve upon the marginal distribution-based 
procedures from Section 3.3.1 by exploiting possible 
dependences among the test statistics used to test 
the individual hypotheses. However, these methods 
cannot be applied to testing of correlations, as the 
required condition known as "subset-pivotality" 
is not satisfied in this context (Westfall and Young 
(1993), page 43). A way around this condition was 
found recently by Pollard and van der Laan (2004), 
Dudoit, van der Laan and Pollard (2004) and van der 
Laan, Dudoit and Pollard (2004b), who describe how 
a consistent estimate of the asymptotic joint multi- 
variate normal distribution of the test statistics can 
indeed be used for a valid p-value adjustment. We 
now detail this approach in our context. 

For sample size n tending to infinity, our vector 
of test statistics z = (^y.c(ij) | 1 < « < J < p) has a 
multivariate normal limiting distribution which can 
be derived from that of r = (^y.c(ij) I 1 < i < 3 < p) 
using Proposition 5. We obtain the normal approx- 
imations 



(3.13) 



where p = (/Oy.c(ij) I 1 < * < 3 < p), C is the 
component- wise z-transform of p, and N is the diag- 
onal matrix with diagonal entries yf^c(ij) ~ 3- Re- 
call that for undirected and bidirected graphs with 
C(i,j) = V \ {i,j} and C(i,j) = 0, respectively, 
Proposition 3 yields the asymptotic covariance ma- 
trix f2, whereas for DAGs with C(i,j) = {1, . . . ,j}\ 
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Fig. 2. Simulated family-wise error rates for false edge inclusion in undirected graph model selection. The multiple testing 
procedures were set up to control the family-wise error rate at level a = 0.1 and 10,000 samples were drawn from a multivariate 
normal distribution for p = 7 variables; sample size varied from n = 25 to n = 500. In the normal distribution, nine partial 
correlations were nonzero with values in [0.2,0.55]. 



{i,j} no general formula for the asymptotic covari- 
ance matrix seems to be available. 

The asymptotic covariance matrix does involve 
unknown quantities derived from the covariance ma- 
trix £ of our observed random vectors. Plugging in 
the corresponding expression formed from the sam- 
ple covariance matrix S, we obtain a consistent esti- 
mator f2. We can then determine the so-called max- 
T adjusted p- values 



717 



Prob 



TV^/V- 1 Corr(Q)iV-*) 



(3.14) 



max 

l<M<i;<p 



\Zuv\ ^ \ z ij-C(i,j) I I) 



These probabilities can be computed by Monte Carlo 
simulation drawing the vector Z = (Z uv ) from JV(0, 
iV _1 Corr(0)iV~ < ). A step-down max-T procedure is 
also available. It is based on ordering the Zij.c(ij) as 
> | — " ' anc ^ yields the adjusted p- values 



7T 



max, Step 



b ™ ax a Pro V(o,JV- 



■Corr(n)iV-*) 



(3.15) 



max \Z H \ > \Zal\)- 



l<a<© 



[Note that since the asymptotic marginal distribu- 
tions of the z-transformed partial correlations 



z\r, 



v-C{i,j)) are identical, all being standard normal, 



the min-P adjustment is identical to the max-T ad- 
justment; see Dudoit, Shaffer and Boldrick (2003).] 
Both the single-step p-values vr™ ax and the less 

. _ j , max, Step j n 

conservative step-down p-vaiues ir^ define a 

graph G(a) for which the condition (3.6) for asymp- 
totic control of the FWER holds (Pollard and van der 
Laan (2004); Dudoit, van der Laan and 2004; van der 
Laan, Dudoit and Pollard 2004b). In fact, it follows 
from the general results in van der Laan, Dudoit 
and Pollard (2004b) that the step-down p-values 
^max, tep a g ra ph G(a) satisfying 



(3.16) 



lim Prob G (G(a) £ G) 



a. 



In other words, the asymptotic error control is not 
conservative but exact. 

The simulations summarized in Figure 2 show that 
the step-down max-T adjustment method based 
on (3.15) does indeed provide the most exact er- 
ror control for false edge inclusion. However, the 
step-down procedures based on marginal distribu- 
tions may still be useful if, for large number of vari- 
ables p, the Monte Carlo computation needed to 
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compute the max-T adjusted p-values becomes too 
time-consuming. 

3.4 Alternative Error Rates 

In multiple testing-based graphical model selec- 
tion, controlling the FWER amounts to controlling 
the probability that a single one of the edges in- 
cluded in the selected graph is incorrect. This is 
clearly a very stringent requirement. Alternatively, 
other less demanding error rates can be controlled, 
of which we now discuss the three most popular 
ones; see Dudoit, van der Laan and Pollard (2004), 
Romano and Wolf (2005) and van der Laan, Dudoit 
and Pollard (2004a, 2004b) for recent surveys of the 
relevant literature. 

One relaxation consists of controlling the general- 
ized family-wise error rate (GFWER), which is de- 
fined with respect to a chosen nonnegative integer 
k. This generalization is based on the probability of 
the event that at most k of the true null hypothe- 
ses are incorrectly rejected. In our graphical model 
selection context, if a multiple testing procedure is 
set up to control the fc-GFWER at level a £ (0, 1), 
and its adjusted p- values are used to select a graph 
G(a), then it holds that 

Probc(G(a) contains k + 1 or more 

(3.17) 

edges that are not present in G) <a. 

For k = 0, (3.17) reduces to (3.5), that is, control of 
the traditional FWER. 

In some contexts, one may be willing to live with 
a larger number of erroneous edge inclusion deci- 
sions if the selected graph is less sparse. This can 
be achieved by controlling the tail probability of the 
proportion of false positives, also known as false dis- 
covery proportion. Here a fraction A £ [0, 1) is chosen 
and the probability of the event that more than a 
proportion A of the rejected hypotheses are incor- 
rectly rejected is to be controlled. In the present 
context, control of the tail probability of the pro- 
portion of false positives (TPPFP) at level a allows 
us to select a graph G(a) with the property that 

Probe (More than 100A% of the edges 

(3.18) 

in G{a) are not present in G) < a. 

For A = 0, (3.18) reduces to (3.5). 

Of a somewhat different nature is the false dis- 
covery rate (FDR), which is defined in terms of the 
expectation of the proportion of false positives. Con- 
trolling the FDR at level a allows us to select a 



graph G{a) such that the proportion of incorrect 
edges among all the edges of G(a) is smaller than a 
in expectation, that is, 

|~#edges incorrectly included in G(a)~ 

Eg " 

#edges included in G(a) 

(3.19) 

<a. 

A number of methods for control of GFWER and 
TPPFP have been described in the literature and 
can be applied for graphical model selection. Per- 
haps the simplest methods are the augmentation 
methods of van der Laan, Dudoit and Pollard (2004a). 
The idea there is to first determine the hypothe- 
ses rejected in FWER control and then reject addi- 
tional hypotheses. For /c-GFWER control one sim- 
ply rejects k additional hypotheses from the most 
significant not already rejected ones. For A-TPPFP 
control the augmentation method proceeds similarly 
with the number of additionally rejected hypothe- 
ses determined from the parameter A and the num- 
ber of hypotheses already rejected in FWER con- 
trol. While simple and asymptotically exact if used 
in conjunction with the asymptotically exact max-T 
step-down procedure for FWER control [cf. (3.16)], 
the augmentation methods may sometimes be out- 
performed by methods that address the respective 
generalized error rate directly and not via FWER 
control; see Romano and Wolf (2005), who survey 
such methods that can be designed to employ either 
the joint distribution of the test statistics or only 
their marginals. 

For control of the FDR, the original step-up method 
of Benjamini and Hochberg (1995) is not generally 
applicable in our context as it requires the test statis- 
tics to exhibit a form of dependence termed "posi- 
tive regression dependency" (Benjamini and Yekutieli 
(2001)). A generally valid method is obtained by 
introducing a log-term as penalty in the step-up 
method (Benjamini and Yekutieli (2001)). Alterna- 
tively, van der Laan, Dudoit and Pollard (2004a) 
proposed a method for FDR control that is derived 
from TPPFP control. 

4. APPLICATION TO GENE EXPRESSION 
DATA 

In this section, we demonstrate multiple testing- 
based graphical model selection using data from n = 
118 microarray experiments collected and analyzed 
by Wille et al. (2004); see also Wille and Buhlmann 
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(2006). The experiments measure gene expression in 
Arabidopsis thaliana, and for our purposes we focus 
on p = 13 genes from the initial part of the MEP 
pathway, which is one of the two pathways that re- 
ceived special attention in Wille et al. (2004). We 
select graphical models by multiple testing in order 
to discover the key features of the correlation struc- 
ture among the considered gene expression measure- 
ments. Revealing such key features is important if 
the goal of the analysis is to generate scientific hy- 
potheses about the interplay of genes in a gene reg- 
ulatory network. First, in Section 4.1, we will select 
different types of graphs with the goal of emphasiz- 
ing how different graphs capture different types of 
dependence. Then, in Section 4.2, we give a simple 
example of the use of alternative error rates. 

4.1 Selecting Different Graphs 

In order to select an undirected graph we apply 
a multiple testing procedure to the testing prob- 
lem (3.3) with C(i,j) = V\{i,j}. Applying the step- 
down max-T procedure from Section 3.3.2 to con- 
trol the classical FWER at simultaneous significance 
level a = 0.15, we select the undirected graph de- 
picted in Figure 3, which has 16 edges. Control of 
the FWER allows us to state that we are 85% confi- 
dent that all the edges in this graph are also present 
in the true graph. In this example, the step-down 
max-T procedure is indeed the most powerful of 
the procedures described in Sections 3.3.1 and 3.3.2. 
For example, if i = DXPS1 and j = GPPS, then 

max.Step = Q Qgg but Bonf = q Qgg 

The undirected graph in Figure 3 shows some of 
the features of larger graphs shown in Wille et al. 
(2004), but there are also differences. Note, however, 
that the graphical modeling approach in Wille et al. 
(2004) is different from any of the approaches de- 
scribed here in that only conditional independences 
involving three variables are considered. 

Next we select a bidirected graph by testing (3.3) 
with C(i,j) = 0. Using again the step-down max-T 
procedure with simultaneous significance level a = 
0.15, we select the bidirected graph in Figure 3. 
The selected bidirected graph features 30 edges. As 
for the selection of the undirected graph, the step- 
down max-T procedure provides the most powerful 
method for FWER control. Use of any of the other 
multiple testing procedures from Sections 3.3.1 and 
3.3.2 results in the selection of a graph with 28 or 29 
edges. The two graphs in Figure 3 have some com- 
mon edges but many adjacencies are different, which 



is a reflection of the fact that large correlations are 
not necessarily associated with large partial correla- 
tions and vice versa. The two correlation measures 
quantify very different types of dependence. 

The vertical placement of the vertices in the graphs 
in Figure 3 reflects a partial order among the con- 
sidered genes that is based on the genes' role in the 
metabolic network (Wille et al. (2004), Figure 2). 
In order to illustrate selection of a DAG, we refine 
this partial order to a total order in which DXPS1 < 
DXPS2 < DXPS3 < DXR < • • • < IPPI1 < GPPS < 
PPDS1 < PPDS2. We then test the hypotheses (3.3) 
with C(i,j) = {1, . . . ,j} \ {i, j}, where the indices i 
and j refer to the rank of a gene in the total or- 
der. Since the asymptotic covariance matrix of the 
sample partial correlations used to test these hy- 
potheses is unknown, we use the step-down p- values 
^Sidak.Step tQ control FWER at i eve i a = o.i5. (Note, 

however, that bootstrap-based methods can be used 
to estimate the unknown joint distribution of the 
test statistics nonparametrically; compare 
Dudoit et al. (2004); van der Laan, Dudoit and Pol- 
lard, 2004a, 2004b; Romano and Wolf (2005).) The 
selected DAG, depicted on the left in Figure 4, has 
19 edges. Among the three graphs we selected, this 
DAG best reflects the structure of the metabolic net- 
work formed by the considered genes; recall, how- 
ever, that we used the structure of the metabolic 
network to form a well-numbering of the variables. 
We remark that a strict causal interpretation of this 
DAG (compare Section 2.3) would rest on the as- 
sumption that there are no hidden/unobserved causes 

4.2 Alternative Error Rates 

In order to convey how more liberal error rates al- 
low for the inclusion of additional edges, we consider 
the selection of DAGs under control of GFWER and 
TPPFP. Since we have already computed p-values 
for FWER control, the augmentation methods of 
van der Laan et al. (2004a) can be readily applied. 

For control of the /c-GFWER, we simply deter- 
mine the k smallest of the FWER p-values that 
are associated with hypotheses not rejected by the 
FWER controlling procedure. We then reject the k 
hypotheses corresponding to these p-values. Choos- 
ing k = 5 and keeping the simultaneous significance 
level a = 0.15 used above, we select the DAG with 
24 edges shown on the right-hand side in Figure 4. 
Due to the GFWER control we can state that we 
are 85% confident that at most five edges in this 
graph are not present in the true underlying graph. 




Fig. 4. Two DAG's selected by the step-down Sidak procedure. The graph to the left is obtained by controlling FWER at 
level a — 0.1, the one to the right by controlling k-GFWER with k = 5 at a — 0.1. 
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(We remark that a simple, more direct step-down 
approach to GFWER control described in 
Lehmann and Romano (2005), Theorem 2.2, leads 
to the same DAG.) 

TPPFP control by augmentation proceeds again 
by rejecting additional hypotheses not yet rejected 
by the procedure for FWER control. Keeping with 
a = 0.15 and choosing the proportion A = 0.22, the 
A-TPPFP control by augmentation again yields the 
graph on the right-hand side in Figure 4. Therefore, 
we are 85% confident that at most 22% of the edges 
of this graph are not present in the true underlying 
DAG. 

5. INCORPORATING PRIOR INFORMATION 
ABOUT THE PRESENCE OR ABSENCE OF 
EDGES 

Suppose it is known that in the true graph G = 
(V,E) certain edges Eq are absent, Eq n E = 0, 
and certain other edges E\ are present, E\ C E. 
Model selection then reduces to the problem of de- 
termining the absence or presence of the uncertain 
edges E u , that is, the complement of Eq U E\ in 
the set of all possible edges. Let G up = (V, E\ U E u ) 
denote the upper graph, which contains all edges 
known to be present as well as all uncertain edges. In 
the context of DAGs, Consonni and Leucari (2001) 
call the upper graph the "full graph." Similarly, let 
Glow = (V,E\) denote the lower graph, which in- 
cludes only the edges that are known to be present. 
Thus the true graph G satisfies G7i ow ^ G Q G up , 
where G?i ow and G7 up are known. If all edges are un- 
certain, then the upper and lower graph are the com- 
plete and the empty graph, respectively. 

The multiple testing approach presented in Sec- 
tion 3 extends readily to the present case by re- 
ducing the p(p — l)/2 simultaneous testing prob- 
lems from (3.3) to the q = \E U \ testing problems 
corresponding to the uncertain edges only Since q < 
p{p — l)/2, we have fewer testing problems to con- 
sider and gain power in simultaneous testing. Fur- 
thermore, since G C G up , the conditional indepen- 
dences holding in G? up also hold in G, which may 
allow for additional power gain because, as we ex- 
plain next, the hypothesis Hy : pij.cu,j) = may be 
reformulated equivalently using a smaller condition- 
ing set C up (i,j) C C(i,j). By working with a smaller 
conditioning set, the effective sample size ncttj) = 
n — \C(i,j)\ is increased; in this context see also 
Wille and Buhlmann (2006). In Sections 5.1 and 5.2, 



we detail this reasoning for undirected graphs and 
DAGs, respectively. For bidirected graphs, the con- 
ditioning set occurring in the testing problem (3.3) 
is already as small as possible as it is the empty set 
C(i,j)=0. 

5.1 Decreasing the Size of the Conditioning Set 
in Undirected Graphs 

The following graphical condition is the key to 
finding a smaller conditioning set C up (i,j) C C(i,j). 

Lemma 6. Suppose the observed random vector 
Y is distributed according to a multivariate normal 
distribution that is pairwise faithful to an undirected 
graph G = (V,E). Let i,j£Vbe two vertices and de- 
fine G %i to be the subgraph of G obtained by removing 
the edge i — j, which may or may not be present in 
G. Let C C V\{i,j} be a subset that separates i and 
j in G %3 . Then, 

Yi 11 Yj | Y V \ {iJ} YtAL Yj \ Y c . 

PROOF. (=*•): By the faithfulness assumption, 
G does not contain the edge i — j , so G = G lJ . Thus 
the global undirected Markov property for G (see 
Section 2.1) implies Yi _LL Yj | Yc- 

(<=)■ Let nb(j) be the set of vertices in V \ 
U C) that, in the graph G 13 , are connected 
to j by an edge. In G, the set C U {j} separates i 
and nb(j) and thus we obtain via the global Markov 
property for G that 

(5.1) YilLY^Yc^y 

Applying standard properties of conditional inde- 
pendence (Lauritzen (1996), Section 3), we obtain 
from (5.1) and the assumed Yi _LL Yj \ Yc that 

(5.2) YiALYj\Y CunbU) . 

Moreover, in the graph G, the set C U nb(j) U {i} 
separates j from the remaining vertices V\ ({i,j} U 
CUnb(j)). Hence, by the global Markov property 
for G, 

(5.3) Yj _LL lV\({ij}UCUnb(j)) I Ycunb{j)u{i}i 

which in conjunction with (5.2) implies that Yi _LL 
Yj | V\{i,j}. (Note that this proof could be reduced 
to a single application of the global Markov property 
if global faithfulness was assumed about the data- 
generating distribution. Under global faithfulness, 
Yi _LL Yj | Y c implies G = G ij .) □ 
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Fig. 5. Testing the edge 3 — 6 in an undirected upper graph. 

Consider now testing an uncertain edge i — j in 
E u by testing the hypothesis Hij : Pij-v\{i,j} = 0- Re- 
move the edge i — j from G U p to obtain the graph 
Gup- I n this known graph GJf p we can determine a 
subset C np (i,j) C V \ {i,j} that separates i and j. 
(Choosing this subset to be of minimal cardinality 
yields the largest gain in effective sample size.) We 
know a priori that the true data-generating distri- 
bution is pairwise faithful to an undirected graph 
G which is a subgraph of the known graph G up . 
Since graphical separation in G np implies graphical 
separation in the subgraph G, we can deduce from 
Lemma 6 that 

(5.4) Pij-v\{i,j} = <=> Pij-c up (i,j) = Q- 

As an example, consider as an upper graph the 
undirected graph in Figure 5. For testing the edge 
3 — 6, which is drawn dotted, we can test 
#36 : P36-12457 = but also simply H 3G : p 36 . i5 = or 
#36 : P36-57 = 0. 

Based on (5.4), the model selection testing prob- 
lem in (3.3) can be replaced by the problem of test- 
ing the q hypotheses 

: Pij. Cup (i,j) = vs. Kij : Pij. Cup (i,j) + 0, 

(5.5) 

(i,j)eE u . 

These hypotheses can again be tested using the cor- 
responding sample partial correlations ^y.c up (ij)j or 
rather the sample z-transforms ^ij.c up U,j)- Proposi- 
tion 5 still holds for the vector (^'.c u (jj) I (hj) e 
E u ). Instead, one could also work with more effi- 
cient maximum likelihood estimates computed for 
the model N(G np ). It should be noted, however, 
that the z-transform need no longer be variance- 
stabilizing when applied to such maximum likeli- 
hood estimates of partial correlations 
(Roverato (1996)). 

5.2 Decreasing the Size of the Conditioning Set 
in DAGs 

In a DAG with well- numbered vertex set, the con- 
ditioning set C(i,j) = {1, . . . , j} \ {i,j} may be re- 
duced to any subset C up (i,j) that d-separates i and 



j in the graph G l ^ p , defined to be the upper DAG 
G up with the edge i — > j removed. The validity of 
this replacement can be established using Lemma 7. 
We note that a simple choice for such a set C up (i,j) 
are the parents of j in G 1 ^, that is, the set of vertices 
k that are such that k — >j in GJ/ p . However, this 
need not be the d-separating set in G % 3 of smallest 
cardinality. 

Consider, for example, the DAG in Figure 6 as an 
upper DAG. For testing the edge 1 — 5, which is 
drawn dotted, we can use the parents of 5 to test 
#15 : P15-34 = but alternatively we can test Hi 5 : 

P15-2 = 0. 

Lemma 7. Suppose the observed random vector 
Y is distributed according to a multivariate normal 
distribution that is pairwise faithful to a DAG G = 
(V, E) with well-numbered vertex set. For any ver- 
tices i,j£V define G %3 to be the subgraph of G ob- 
tained by removing the edge i — > j, which may or 
may not be present in G. Let C C {1, . . . , j'} \ {i,j} 
be a subset that d-separates i and j in G lJ . Then, 

/r nX Y i 11 Y 3 I Y {l,-,j}\{i,j} 

(5.6) 

Yi 1L Yj | Y C . 

PROOF. By the faithfulness assumption, 

G does not contain the edge i — ► j, so G = G lJ . 
Thus the global directed Markov property for G (see 
Section 2.3) implies Yi _LL Yj \ Yq. 

(<=): Let G = (V,E) be the subgraph of G in- 
duced by {1, . . . , j} and set Y = Yu ^y. Then V is 
well-numbered for G and Y is pairwise faithful to 
G. Because C d-separates i and j in G 13 if and only 
if C d-separates i and j in G lJ and because (5.6) 
involves only Y, we may assume for the proof that 
G = G and Y = Y. Note that j is now a terminal 
vertex in G; that is, j has no children in G. 

Let pa(j') be the set of parents of j in G lJ . We 
claim that pa(j') \ C and i are d-separated in G given 
C. For, any path 7 between i and some k G pa(j') \ C 
in G either includes j as a nonendpoint or does not. 




Fig. 6. Testing the edge 1 — > 5 in a DAG. 
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In the first case, since j is terminal in 67 it must be a 
collider in 7 with no directed path to any c E C, so 7 
is blocked relative to C in 67. In the second case, con- 
sider the extension of the path 7 given by appending 
the edge k — > j. By the assumed d-separation of i 
and j in 67 1 - 7 given C and the fact that k C is a 
noncollider in the extended path, the path 7 must 
again be blocked relative to C in G. This establishes 
the claim. 

Hence, by the global directed Markov property 
for G, 

(5.7) YilLY^dYc 

Because Y has a multivariate normal distribution, 
it follows from (5.7) and the assumed independence 
Y il Yj I Y c that Yi 11 (Yj,Y pa ^\ c ) | Y c , hence 

(5.8) YiALY^Yc^y 

Since j is terminal in 6?, the set C U pa(j) U {i} 
d-separates j from the remaining vertices {1, . . . 
({i, j} U C U pa(j')) in 67. Therefore, by the global 
Markov property for G, 

(5.9) Yj _LL j}\({ij}uCUpa(j)) I *CUpaO»U{i} . 

which in conjunction with (5.8) implies that Yi _I_L 
Yj I (This proof could be reduced to a 

single application of the global Markov property if 
global faithfulness were assumed for Y.) □ 

Remark 8. The proof of implication (-^=) in 
Lemma 6 holds for any distribution, not necessarily 
Gaussian, that obeys the global Markov property of 
the graph G. This is in contrast to the proof of impli- 
cation (<=) in Lemma 7, where we have employed 
a special property of the multivariate normal distri- 
bution when deducing (5.8). This special property, 
namely the fact that Yi _LL Yj and Yi _LL implies 
Yi _LL (Yj,Yk), is crucial. For example, it is easy to 
choose a joint distribution for a binary random vec- 
tor (Y lt Y 2 , Y 3 y such that Y\ II Y 2 and Y 1 AL Y 3 but 
Y\ AL (12,13). This distribution is then pairwise but 
not globally faithful to the DAG 1 — ► 3 < — 2, and 
since Y\ AL I3 | Y%, it yields a contradiction to the 
claim of Lemma 7 for binary random variables. A 
way around an assumption of global faithfulness is 
to apply Lemma 6 to an undirected graph 67™ such 
that N(G up ) C N(G™ p ). Such a graph 67™ can be 
obtained via the moralization procedure (Lauritzen 
(1996)). 



6. DISCUSSION 

Gaussian graphical models are determined by pair- 
wise (conditional) independence restrictions, which 
are in correspondence to the edges that are absent 
from the underlying graph. These restrictions can 
be converted into a set of hypotheses that can be 
tested in order to select a model, or equivalently, a 
graph. If the arising issue of multiple testing is ap- 
propriately addressed, then the selection of a graph 
can be performed while controlling error rates for 
incorrect edge inclusion. As reviewed in Section 3, 
the literature provides a number of methods for such 
error rate control. 

In graphical model selection, controlling incorrect 
edge inclusion allows us to detect the most impor- 
tant features of multivariate dependence patterns. 
However, the graph encoding these features need not 
necessarily yield a model that fits the data well, and 
if the choice of such a model is the primary focus, 
then other model selection methods (e.g., the score- 
based and Bayesian methods discussed in Section 1) 
may be preferable. 

The number of edges in the graph to which the 
true data-generating distribution is pairwise faith- 
ful equals the number of false null hypotheses Hij 
in (3.3). It would be interesting to adapt existing 
methods for estimating or bounding this latter num- 
ber (see, e.g., Meinshausen and Biihlmann (2005)) 
to make them applicable in graphical model selec- 
tion. This would allow us to assess the sparseness of 
the underlying graph by estimating or bounding the 
number of its edges. Such knowledge could also be 
used to design more powerful multiple testing-based 
methods of graph selection in an empirical Bayes 
framework (see, e.g., Efron et al. (2001)). 

In order to associate unique null hypotheses with 
acyclic directed graphs (DAGs), we restricted our- 
selves to the situation where a well-numbering of the 
variables is known a priori. Requiring the knowledge 
of such a total order is clearly very restrictive. More 
commonly, time of observation of variables and other 
considerations provide a priori knowledge in form of 
a partial order, which allows us to identify ordered 
blocks of variables. Such blocking strategies appear 
in many case studies (Caputo, Heinicke and Pigeot 
(1999), Caputo et al. (2003), Didelez et al. (2002), 
Mohamed, Diamond and Smith (1998)); compare 
also Wermuth and Lauritzen (1990). In our illustra- 
tion of DAG selection in Section 4 the structure of 
a metabolic pathway yields a partial order among 
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genes that we extended rather arbitrarily to a to- 
tal order. Hence, a more appropriate analysis might 
proceed by using the ordered blocks of variables and 
a generalization of the multiple testing-based model 
selection we described in order to select a Gaussian 
chain graph model; see Drton and Permian (2007) 
for details on this generalization. 

For the class of ancestral graphs of Richardson 
and Spirtes (2002), multiple testing-based model se- 
lection as presented here is less natural. The rea- 
son is that in this class there is no distinguished 
complete graph whose edges could be tested for ab- 
sence/presence by testing associated conditional in- 
dependences. If a particular complete ancestral graph 
is chosen, then a subgraph could be selected simi- 
larly as for the other models. The pairwise Markov 
property (Richardson and Spirtes (2002), page 979) 
would yield the conditional independences that would 
have to be tested. However, this procedure could not 
assure that the selected ancestral graph is maximal 
(Richardson and Spirtes (2002), page 978). Gaus- 
sian models associated with nonmaximal ancestral 
graphs cannot in general be specified in terms of 
conditional independence. 

Finally, many of the ideas presented here in the 
framework of Gaussian models carry over to the case 
of discrete variables or even the mixed case of dis- 
crete and continuous variables. Background on the 
distributional assumptions in the mixed case can be 
found, for example, in Lauritzen (1996). However, 
when adapting the methods reviewed here, care must 
be taken, as conditional independence in the multi- 
variate normal distribution exhibits special proper- 
ties not shared by other distributional settings. In 
particular, moving from conditional independence 
statements between pairs of random variables to ones 
involving sets of random variables may be valid in 
a multivariate normal distribution but not in other 
distributions; compare Remark 8. 
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